7,066 research outputs found
From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning
Video captioning in essential is a complex natural process, which is affected
by various uncertainties stemming from video content, subjective judgment, etc.
In this paper we build on the recent progress in using encoder-decoder
framework for video captioning and address what we find to be a critical
deficiency of the existing methods, that most of the decoders propagate
deterministic hidden states. Such complex uncertainty cannot be modeled
efficiently by the deterministic models. In this paper, we propose a generative
approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which
models the uncertainty observed in the data using latent stochastic variables.
Therefore, MS-RNN can improve the performance of video captioning, and generate
multiple sentences to describe a video considering different random factors.
Specifically, a multi-modal LSTM (M-LSTM) is first proposed to interact with
both visual and textual features to capture a high-level representation. Then,
a backward stochastic LSTM (S-LSTM) is proposed to support uncertainty
propagation by introducing latent variables. Experimental results on the
challenging datasets MSVD and MSR-VTT show that our proposed MS-RNN approach
outperforms the state-of-the-art video captioning benchmarks
Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning
Recent progress has been made in using attention based encoder-decoder
framework for video captioning. However, most existing decoders apply the
attention mechanism to every generated word including both visual words (e.g.,
"gun" and "shooting") and non-visual words (e.g. "the", "a"). However, these
non-visual words can be easily predicted using natural language model without
considering visual signals or attention. Imposing attention mechanism on
non-visual words could mislead and decrease the overall performance of video
captioning. To address this issue, we propose a hierarchical LSTM with adjusted
temporal attention (hLSTMat) approach for video captioning. Specifically, the
proposed framework utilizes the temporal attention for selecting specific
frames to predict the related words, while the adjusted temporal attention is
for deciding whether to depend on the visual information or the language
context information. Also, a hierarchical LSTMs is designed to simultaneously
consider both low-level visual information and high-level language context
information to support the video caption generation. To demonstrate the
effectiveness of our proposed framework, we test our method on two prevalent
datasets: MSVD and MSR-VTT, and experimental results show that our approach
outperforms the state-of-the-art methods on both two datasets
Path Tracking of a Wheeled Mobile Manipulator through Improved Localization and Calibration
This chapter focuses on path tracking of a wheeled mobile manipulator designed for manufacturing processes such as drilling, riveting, or line drawing, which demand high accuracy. This problem can be solved by combining two approaches: improved localization and improved calibration. In the first approach, a full-scale kinematic equation is derived for calibration of each individual wheel’s geometrical parameters, as opposed to traditionally treating them identical for all wheels. To avoid the singularity problem in computation, a predefined square path is used to quantify the errors used for calibration considering the movement in different directions. Both statistical method and interval analysis method are adopted and compared for estimation of the calibration parameters. In the second approach, a vision-based deviation rectification solution is presented to localize the system in the global frame through a number of artificial reflectors that are identified by an onboard laser scanner. An improved tracking and localization algorithm is developed to meet the high positional accuracy requirement, improve the system’s repeatability in the traditional trilateral algorithm, and solve the problem of pose loss in path following. The developed methods have been verified and implemented on the mobile manipulators developed by Shanghai University
A note on pentavalent s-transitive graphs
AbstractA graph, with a group G of its automorphisms, is said to be (G,s)-transitive if G is transitive on s-arcs but not on (s+1)-arcs of the graph. Let X be a connected (G,s)-transitive graph for some s≥1, and let Gv be the stabilizer of a vertex v∈V(X) in G. In this paper, we determine the structure of Gv when X has valency 5 and Gv is non-solvable. Together with the results of Zhou and Feng [J.-X. Zhou, Y.-Q. Feng, On symmetric graphs of valency five, Discrete Math. 310 (2010) 1725–1732], the structure of Gv is completely determined when X has valency 5. For valency 3 or 4, the structure of Gv is known
DRPT: Disentangled and Recurrent Prompt Tuning for Compositional Zero-Shot Learning
Compositional Zero-shot Learning (CZSL) aims to recognize novel concepts
composed of known knowledge without training samples. Standard CZSL either
identifies visual primitives or enhances unseen composed entities, and as a
result, entanglement between state and object primitives cannot be fully
utilized. Admittedly, vision-language models (VLMs) could naturally cope with
CZSL through tuning prompts, while uneven entanglement leads prompts to be
dragged into local optimum. In this paper, we take a further step to introduce
a novel Disentangled and Recurrent Prompt Tuning framework termed DRPT to
better tap the potential of VLMs in CZSL. Specifically, the state and object
primitives are deemed as learnable tokens of vocabulary embedded in prompts and
tuned on seen compositions. Instead of jointly tuning state and object, we
devise a disentangled and recurrent tuning strategy to suppress the traction
force caused by entanglement and gradually optimize the token parameters,
leading to a better prompt space. Notably, we develop a progressive fine-tuning
procedure that allows for incremental updates to the prompts, optimizing the
object first, then the state, and vice versa. Meanwhile, the optimization of
state and object is independent, thus clearer features can be learned to
further alleviate the issue of entangling misleading optimization. Moreover, we
quantify and analyze the entanglement in CZSL and supplement entanglement
rebalancing optimization schemes. DRPT surpasses representative
state-of-the-art methods on extensive benchmark datasets, demonstrating
superiority in both accuracy and efficiency
- …